A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake
نویسنده
چکیده
The modern CPU’s design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonicbased sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floatingpoint numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU C++ sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4. Keywords—Quicksort; Bitonic; sort; vectorization; SIMD; AVX512; Skylake
منابع مشابه
Fast Sorting Algorithms using AVX-512 on Intel Knights Landing
The modern CPU’s design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, whi...
متن کاملIntel Skylake review
The Skylake Xeon (called Xeon Scalable processor) introduced a number of innovations, notably AVX-512 vectorization instructions capable of 8-wide double precision vectors (previous AVX2 had 4-wide DP vectors). This change in itself has a potential of doubling performance of floating point codes. Other changes include CPU core optimizations, rearchitecture of the caches, and new, mesh-based top...
متن کاملExtreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering
Abstract. Algorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our FMM Helmholtz kernels are capable of t...
متن کاملLandau Collision Integral Solver with Adaptive Mesh Refinement on Emerging Architectures
The Landau collision integral is an accurate model for the small-angle dominated Coulomb collisions in fusion plasmas. We investigate a high order accurate, fully conservative, finite element discretization of the nonlinear multi-species Landau integral with adaptive mesh refinement using the PETSc library (www.mcs.anl.gov/petsc). We develop algorithms and techniques to efficiently utilize emer...
متن کاملImplementing BLAKE with AVX, AVX2, and XOP
In 2013 Intel will release the AVX2 instructions, which introduce 256-bit singleinstruction multiple-data (SIMD) integer arithmetic. This will enable desktop and server processors from this vendor to support 4-way SIMD computation of 64-bit add-rotate-xor algorithms, as well as 8-way 32-bit SIMD computations. AVX2 also includes interesting instructions for cryptographic functions, like any-to-a...
متن کامل